We want to study an E-commerce dataset:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
file = pd.read_csv('data.csv',encoding="ISO-8859-1")
df = file.copy()
total_sales = df[['Description', 'Quantity', 'UnitPrice', 'CustomerID']].copy()
total_sales['Total Sales'] = total_sales['Quantity'] * total_sales['UnitPrice']
ten_products = (
total_sales.groupby('Description')['Total Sales']
.sum()
.sort_values(ascending=False)
.head(10)
.reset_index()
)
ten_products['Total Sales'] = ten_products['Total Sales'].round(2)
fig = px.bar(
ten_products,
x='Description',
y='Total Sales',
title='Top 10 products by Total Sales',
labels={
'Description': 'Product Description',
'TotalSales': 'Total Sales (in Dollars)'
},
text='Total Sales'
)
fig.update_layout(xaxis_tickangle=45)
fig.show()
The product with the highest total sales from the 'Top 10 products by total sales' was 'DOTCOM POSTAGE' with 206,245 and the lowest was 'CHILLI LIGHTS' with 53768.
total_sales = df[['Description', 'Quantity', 'UnitPrice', 'Country']].copy()
total_sales['Total Sales'] = total_sales['Quantity'] * total_sales['UnitPrice']
ten_countries = (
total_sales.groupby('Country')['Total Sales']
.sum()
.sort_values(ascending=False)
.head(10)
.reset_index()
)
ten_countries['Total Sales'] = ten_countries['Total Sales'].round(2)
fig = px.bar(
ten_countries,
x='Total Sales',
y='Country',
title='Top 10 Countries by Total Sales',
labels={
'Total Sales': 'Total Sales (in Dollars)',
'Country': 'Country'
},
text='Total Sales',
orientation='h'
)
fig.show()
The country that had the most sales was United Kingdom. Which had significantly more than any other country as it had a difference of 7 million. From the 'Top 10 countries by Total Sales' Sweden had the least by 36595.
total_quantity = df[['Description', 'Quantity', 'CustomerID']].copy()
six_products = (
total_quantity.groupby('Description')['Quantity']
.sum()
.sort_values(ascending=False)
.head(6)
.reset_index()
)
fig = px.pie(
six_products,
values='Quantity',
names='Description',
title='Distribution of Quantities Sold for Top 6 Products',
)
fig.show()
The top product sold by quantity was 'World War 2 Gliders Assstd Desgins' with it being 22% and the lowest was 'White Hanging Heart T-Light Holder' with it being at 14.4%
monthly_sales = df.copy()
monthly_sales['InvoiceDate'] = pd.to_datetime(monthly_sales['InvoiceDate'])
monthly_sales['Month'] = monthly_sales['InvoiceDate'].dt.to_period('M')
monthly_sales = (
monthly_sales.groupby('Month')
.apply(lambda x: (x['Quantity'] * x['UnitPrice']).sum())
.reset_index(name='Total Sales')
)
monthly_sales['Month'] = monthly_sales['Month'].astype(str)
fig = px.line(
monthly_sales,
x='Month',
y='Total Sales',
title='Total Monthly Sales',
labels={'Month': 'Month', 'Total Sales': 'Total Sales (in Dollars)'},
markers=True
)
# Display the chart
fig.show()
There was a peak on November 2011 with it being on 1.46 million. Then had a drop to 433k on December. I was a little caught by suprise on this because I would expect there to be more sales throughout the holidays.
top_products = df.groupby('Description')['Quantity'].sum().nlargest(6).index
tp_df = df[df['Description'].isin(top_products)]
fig = px.box(
tp_df,
x='Description',
y='UnitPrice',
title='Distribution of UnitPrice for Top 6 Products by Quantity Sold',
labels={'Description': 'Product Description', 'UnitPrice': 'Unit Price ($)'},
)
fig.show()
The 'White Hanging Heart T-Light Holder' had the greatest max at 6.77. The second highest was the 'Pack of 72 Retrospot Cake Cases' with a max of 5. The one that had the lowest distribution was 'World War 2 Gliders Asstd Designs'
fig = px.scatter(
df,
x='UnitPrice',
y='Quantity',
title='Correlation between Unit Price and Quantity',
labels={
'UnitPrice':'Unit Price (in dollars)',
'Quantity':'Quantity Sold'
},
trendline='ols',
)
fig.show()